-
Notifications
You must be signed in to change notification settings - Fork 31.2k
T5gemma2 #41834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
T5gemma2 #41834
Conversation
|
Hey! Is this a pre-release model? I don't see the checkpoints like |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
|
cc @ArthurZucker in that case! |
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an initial review from my side
3 main issues I would have are
- We definitely should be able to have a SWA for bidirectional masks directly in the utils.
- The process fn to process pixel values should not be passed around imo, it should be handled a level above
- We should have one level for the encoder, decoder models to be wrapped under language model
|
@vasqu Thanks for the comments. Re the major issues,
I agree with this move but I think this is better done by transformers side since it's a big behavior change. wdyt?
This relates to how the transformers handle generation for encoder-decoder models. Two abnormal behaviors:
This is the main reason that we have some wired designs in T5Gemma2, including the dynamic adjustment for sliding window size, and the special handle of the vision preprocessor. Please let me know if you have any suggestions!
I'm not sure if it's a good idea to have another level of wrapper for encoder-decoders, as it's common to put encoders and decoders into the model jointly, like T5/Bart/T5Gemma. |
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key change should be how we handle caches and the encoder with vision preprocessing (get image features etc).
Regarding generation related issue, for now we should override the respective functions in the generation mixin from which we inherit. I agree that we ideally have proper logic in our code, but I want to postpone this for now and fix this properly in the future; small overrides are fine and we always encounter these here and there already.
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, looking already pretty good. There are no big issues tbh, it's just few smaller issues for standards
vasqu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for iterating! I left some last comments which are mostly things to shorten / simplify things.
If we could remove some test overrides, that would be awesome. Atm, it's a lot which is unideal but also not the world.
|
cc @ArthurZucker @Cyrilvallez for core maintainer This is an encoder-decoder model with multimodal capabilities. To properly interact with our generation pipeline, the vision backbone is within the encoder. |
1. Override _prepare_cache_for_generation to take care of cross-attention cache. 2. Move vision preprocessing from main model to encoder. 3. Clean and fix bugs in modular model.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, t5gemma, t5gemma2 |
What does this PR do?
Add support for T5Gemma2 with multi-modal and long-context capability.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.